A voice and speech corpus of patients who underwent upper airway surgery in pre- and post-operative states

Many research articles have explored the impact of surgical interventions on voice and speech evaluations, but advances are limited by the lack of publicly accessible datasets. To address this, a comprehensive corpus of 107 Spanish Castilian speakers was recorded, including control speakers and patients who underwent upper airway surgeries such as Tonsillectomy, Functional Endoscopic Sinus Surgery, and Septoplasty. The dataset contains 3,800 audio files, averaging 35.51 ± 5.91 recordings per patient. This resource enables systematic investigation of the effects of upper respiratory tract surgery on voice and speech. Previous studies using this corpus have shown no relevant changes in key acoustic parameters for sustained vowel phonation, consistent with initial hypotheses. However, the analysis of speech recordings, particularly nasalised segments, remains open for further research. Additionally, this dataset facilitates the study of the impact of upper airway surgery on speaker recognition and identification methods, and testing of anti-spoofing methodologies for improved robustness.

Participants.Speech data from 107 Spanish Castilian speakers (56 women, 51 men) were systematically recorded over two years at the Otorhinolaryngology Service of Hospital Universitario de Fuenlabrada, Spain.All participants had planned surgeries and were recorded in three time instants: 15 days before surgery (Ses.1), 15 days post-surgery (Ses.2), and 3 months post-surgery (Ses.3).Inclusion criteria for controls encompass adults over 18 years of age scheduled for minor otorhinolaryngology surgeries, excluding those with previous vocal tract surgery, neck cancer, or disorders related to speech or voice.All participants underwent a clinical examination, including oral cavity and fiberoptic naso-endoscopic evaluations to confirm inclusion criteria.Patients who had pathological GRBAS (i.e., GRBAS > 0) or presented noticeable pathology in the vocal folds (i.e.lesions, scars, reflux) observed during nasofibroscopy were excluded from the study.
The total number of subjects is limited by the availability of patients received at the ENT service of Hospital de Fuenlabrada with the aforementioned pathologies.Moreover, the sample size has been determined to ensure significance in the previous studies 24,27,28 .The different sessions were defined to ensure a basal recording as close as possible to the surgical procedure, to evaluate the mid-term effect of the surgery, and to ensure that the patient was completely recovered from it.
The corpus comprises data from four distinct cohorts, categorised according to the type of surgery performed, all conducted by the same surgeon.Three out of the four cohorts underwent supraglottal tract surgeries, while the fourth had a minor surgery unrelated to the vocal tract: • Tonsillectomy (Tonsill.):Involving the removal of inflamed or infected tonsils, usually performed in patients with recurrent tonsillitis who have not responded to other treatments.The average grade of tonsillitis for these participants was 3.1 on a scale of 0 to 4 30 .The dataset contains 25 patients belonging to this group.• FESS (FESS.):A procedure to address nasal polyposis and chronic rhinosinusitis, which can cause nasal obstruction.This surgery involves the removal of inflamed nasal mucosal tissue and ethmoid bone cells to improve nasal airflow and reduce rinolalia.This cohort contains 27 patients.• Septoplasty (Sept.):An intranasal procedure that corrects septal cartilage shape deformities, which can create airflow resistance during breathing and nasal ventilation.The dataset contains 29 patients in this group.• Minor surgery (Contr.):This group underwent minor repair surgeries not related and not affecting the voice, speech, or vocal tract, thus serving as the control group.This cohort is built around 26 patients.
Demographic data for each patient were collected systematically during Ses. 1.These data encompassed several key details, including the patient's age, which was represented as an integer, their self-declared gender indicated by a free string entry, and their height measured in centimetres.
Furthermore, the data included information on the patient's diagnosis, documented as a free text blob field.Smoking habits were stored in a binary field, and the presence of OSA was marked similarly in a binary entry.For those who use CPAP therapy, its use was also documented.The professional use of voice (i.e., if voice is essential to their job) was saved in a binary field indicating "True/False".The surgery date was also recorded as a datetime field (for the Contr.group, the date, when present, refers to a minor surgery not related to voice or speech).Additionally, doctor's comments were stored in a free-text field written in Spanish.The Tonsill.cohort had specific information on tonsillar grade, while FESS.and Sept. had data related to Lund-Mackay scoring.
Regarding missing data, it should be noted that it was only observed in the last four columns: surgery date, doctor's comments, tonsillar grade, and Lund-Mackay score.For more details on the extent of missing data, please refer to Table 1, which indicates the percentage of missing data in these columns for each cohort.
Clinical Data.At the beginning of each session, detailed clinical data were collected from each patient.This included annotating the patient's weight.In addition, nasometry assessments were performed at the beginning of each session to measure nasality during sustained vowel /eh:/.For these measurements, the Nasometer II model 6450 was used, as seen in Fig. 1.Nasality was assessed as the ratio of acoustic energy that originates in the nasal tract compared to that of the oral tract.It was used as an indicator of the extent of the velopharyngeal opening during phonation.Lower values were associated with reduced nasality (hyponasalance), while higher values were indicative of increased nasality (hypernasalance).A higher nasalance score is expected after surgery (going from hyponasal to normal nasality).
Furthermore, a nasality questionnaire 9 , previously validated in the Spanish language (following a methodology similar to that of other adaptations 31,32 , was administered to gauge subjective perception of nasality.This  questionnaire consisted of 13 items related to nasal symptoms, each of which was rated on a scale of 0 to 4. The questionnaire is available in the Appendix A. The cumulative scores from these items were used to derive a final total score included in the dataset.Table 2 presents the statistics for nasometry, the nasality questionnaire test, and weight for the different cohorts and sessions.Furthermore, Fig. 2 shows the box plots of the nasalance and nasality values obtained for each pathology and for the three recorded sessions.The box plots in Fig. 2 left correspond to the nasalance values obtained with the nasometer device; and those in Fig. 2 right correspond to the nasality values calculated from the self-assessment questionnaire presented to the patients.
Additionally, a GRBAS evaluation was performed during each visit as a subjective assessment of the quality of the patient's voice.Contr.speakers who showed pathological GRBAS scores or had evident pathology in the vocal folds observed during nasofibroscopy were excluded from the study.Table 3 presents the statistics for subjective evaluations of GRBAS for the different groups and sessions.Furthermore, Fig. 3 shows the evolution of the null values obtained for the GRBAS score for each pathology and for the three recorded sessions.audio registration.The recording protocol, equipment, and environment were the same in all three sessions and across all groups.This meticulous approach was intended to produce high-quality data for subsequent analysis.
The audio recordings were conducted in a meticulously designed environment to ensure high-quality sound capture.Specifically, a soundproof, acoustically isolated, and carefully conditioned room was used, with a reverberation time consistently maintained below 0.25 seconds.
The audio equipment used for the recordings included a headset microphone, the AKG ® C420, which oper- ated at a sampling rate of 44,100 Hz.This microphone was connected to a 24-bit Soundblaster ® Live sound card, which, in turn, was connected to a personal computer equipped with PRAAT ® software 33     During the recording sessions, special attention was paid to ensure the comfort of the patients.They were asked to speak with a comfortable pitch and loudness to obtain the most natural and representative samples.Script Creation.A specific protocol was designed to collect the voice and the speech of the speakers, ensuring a diverse range of vocal sounds and articulations, including sustained vowels, specific phrases, and spontaneous speech.The following acoustic material was recorded: • Sustained vowel /a/: The patients were asked to phonate the sustained vowel /a / three times at a comfortable pitch and loudness, each with an approximate duration greater than 2 seconds.• Sustained vowels /a/, /e/, /i/, /o/, and /u/: Patients were asked to phonate these five different sustained vowels, each with a duration of approximately 1 second, at a comfortable pitch and volume, and with short pauses to breathe between each vowel.0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 Table 3. GRBAS values by cohort and session expressed in mean ± and standard deviation.
Fig. 3 Evolution of the percentage of null GRBAS values per session.A clear increase after the surgery is shown.
• TDU: Patients were instructed to recite four TDUs lasting approximately 10 seconds.These sentences are phonetically balanced and contain several nasalised sounds.The elocutions corresponding to the following sentences (in Spanish) were recorded: • "Corre agua en el arroyo al crepúsculo", which in the International Phonetic Alphabet 34 (IPA) corresponds to: ['ko ře 'a Ɣ wa en el a 'řo yo al kre 'pus ku lo] • "Calienta la casa el brasero y el hornillo de carbón", which in the IPA is: [ka 'ljen ta la 'ka sa el βra 'se ro i el or 'ni Lo ðe kar 'βon] • "Es hábil un solo día", which in the IPA is: [es 'a βil un 'so lo 'ði a] • "La mesa tiene ocho patas", which in the IPA is: [la 'mesa 'tjene ' ot∫o 'patas] • Free monologue: As a final part of the recording process, patients were encouraged to describe a predetermined illustration depicting various activities of daily living, such as taking a shower or cleaning the house, as shown in Fig. 4.This free monologue was recorded for 1 minute.Specialists conducted manual transcriptions, making them accessible in both their raw and clean format, with the latter being carefully edited to exclude interjections, coughings, and background noises during silence periods.

Ethics declaration.
The study was approved by the Ethics Review Board of the Hospital Universitario de Fuenlabrada (IRB: 18/11) in accordance with the Spanish Ethical Review Act.All participants completed the questionnaire and provided their written consent to participate in the study.Patients were individually identified with a code, which is different from the one used in the Hospital for their clinical histories, and no personal data was exchanged between the researchers who had access to the corpus.Patients were informed in detail of their rights and about the possibility of leaving the study at any time.All patients and controls were native Spanish Fig. 4 The free monologue is guided by asking the patient to describe the scenes presented in this illustration.
speakers and followed the same experimental protocol.The otolaryngologist who performed the surgeries was the only person who got in contact with the patients, being also in charge of collecting the clinical data.

Data records
Accessing the data is facilitated through the Zenodo repository 36 .The structural representation of the data set can be seen in Fig. 5.The total size of the data set is 7.08 GB, comprising a collection of 3,800 audio files.The data is distributed across four distinct folders, namely audio features, audios, metadata, and clinical data.In the subsequent sections, a detailed overview of the contents of each of these folders is provided.
Voice Quality Features repository.The "Audio features" folder contains all precomputed voice quality features.These features are available exclusively for the sustained vowel /a/.They are organised into three different .csvfiles named as follows: (i) the first session, which occurred 2 weeks before surgery, is labelled as "Ses1.csv";(ii) the second session, which took place 2 weeks after surgery, is named "Ses2.csv";and, (iii) the third session, which was 3 months after surgery, is indicated as "Ses3.csv".Each .csvfile comprises 35 columns: the initial column indicates the patient ID, while the remaining 34 columns represent each of the precomputed voice quality features mentioned above.audio recordings Bank.The "Audios" folder contains all audio materials, including their formant and antiformant trajectories.This folder is further organised into four subfolders, each corresponding to a specific surgical procedure: "Contr", "FESS", "Sept", and "Tonsill".Within these subfolders, 5 subfolders are found.
Taking the "Sept" folder as an example, it contains 4 sub-subfolders, which are organised as follows: • Folder "Vowels" contains sub-folders "A", "E", "I", "O", and "U".These subfolders contain one-second utterances of each vowel.Each of these folders is divided into subfolders labelled "1", "2", and "3" which correspond to the three recording sessions.Each audio file and their corresponding images and features follows the structure: Fig. 5 Database tree structure: hierarchical representation of the data records and their organization within the database.The structure displayed for the "Sept" folder is also applicable to the "Contr", "FESS", and "Tonsill" folders, but has been omitted for the sake of simplicity.
All audios were also manually edited to remove personal information (e.g., patients stating their names), as well as to eliminate word repetitions during the reading process, coughing, and background noises during silence periods.

Clinical Data repository.
A set of clinical data was also recorded for each patient and session.
The "Clinical Data" folder contains the clinical data recorded following the protocol.It comprises three .csvfiles, one for each session, labelled as follows: (i) "Ses1.csv"for the first session; (ii) "Ses2.csv"for the second; and, (iii) "Ses3.csv"for the third session.
Each .csvfile is structured around 28 columns: the initial column represents the patient's ID; the second column specifies the surgical procedure; and the subsequent 12 columns include detailed clinical data as outlined in the methodology.The next 13 columns, one for each audio utterance, contain the file paths to all patient's audio recordings for the corresponding session.Furthermore, each file contains demographic data for each patient.This .csvfile is structured with the last 14 columns indicating the demographic data recorded according to the methodology outlined in the protocol.
Metadata Bank.The "Metadata" folder contains a file with the doctor's comments on each audio file.
The "Audio_comments.csv"file contains doctor's comments (in Spanish) for each audio (when available), such as "the recording session has noise", "the patient said 'del' instead of 'de"' , or includes a flag that indicates if the audio required manual edition, for example, when the patient said their name.These comments are annotations and clarifications for each audio file.The .csv file consists of 27 columns, with the initial column indicating the patient's ID, and the subsequent 26 columns containing comments for each audio material.For the five vowel sounds, namely /a/, /e/, /i/, /o/, and /u/, the comments are combined into a single column named "aeiou".

technical Validation
The authors extensively explored different subsets of the corpus, which have been the basis for several publications 24,27,28 .
The technical validation of the described data set was carried out by analysing variations in clinical data between the control and surgery groups, before and after surgical procedures.Statistical tests, including t-Student for normally distributed quantitative features, the Wilcoxon test for non-normality distributed quantitative features, and the Fisher exact test for categorical variables, were employed.Table 6 highlights statistically significant variations in subjective measurements (GRBAS and questionnaire), and objective nasalance measurements, highlighting the differences between the control and pathological groups.
Consequently, a detailed examination of the variations between sessions in objective nasalance, subjective self-assessment nasality questionnaires, and GRBAS values were conducted.This analysis revealed notable differences between sessions, as described in Table 7.In the Contr.group, no significant changes were observed in objective or subjective measurements.However, a statistically significant variation was identified in the subjective measurement of nasality between the first and last sessions for both the FESS.and Sept. groups, with p-values < 0.05 and < 0.001, respectively (Fig. 2).Significant variations in GRBAS measurements are evident in all surgical groups, each with p-values < 0.05.This underscores the discernible differences between the control and surgery voices.A notable trend, as illustrated in Fig. 3, is the consistent increase of null values in GRBAS measurements for all surgical procedures, indicating that their voice is improving, compared to the constant trend observed in the Contr.group.

Audio signal
L × 1

Usage Notes
The Python data management scripts are available in the GitHub ® repository cited in the Code Availability section.
For practical guidance and hands-on demonstration, the corpus contains a Jupyter ® notebook named "Usage_notes.ipynb".This notebook includes a comprehensive code section that illustrates the process of reading audio files and normalising the data.Additionally, it features a simple yet illustrative experiment.In this experiment, Mel-Frequency Cepstral Coefficients are computed for the /a/ sustained vowels corresponding to the first session, and a simple Random Forest classifier is trained to distinguish between the control and pathological groups.This simple experiment achieves 74% of accuracy, and provides a practical illustration of how to work with the dataset, serving as a starting point for further exploration and analysis.There is no aim to maximise the classification accuracy with this simple experiment.
a Nasal Questionnaire This section presents the subjective self-assessment nasality questionnaire used.Each item has five possible answers and each is scored with a value from 1 to 5. The total grade assigned to each patient is calculated as the sum of the partial scores for each item evaluated.

Fig. 1
Fig. 1 Use of the Nasometer.The device measures the nasality in percentage as a quotient of the nasal and oral energies.

Fig. 6
Fig. 6 Formant and antiformant trajectories estimated with the KARMA algorithm for the sustained vowel /a/, first session, for a Contr.(a), and a FESS patient (b).

Table 1 .
Percentage of missing demographic data by cohort.

Table 2 .
Weight, nasality questionnaire, and nasometry statistics in terms of mean ± std for the different cohorts and sessions.

Table 4 .
Voice measurements extracted from the recordings of the /a / sustained vowels.

Table 5 .
Dictionary contents for any SurgName_SessionNumber_AudioMaterial_IDPatient.pklfile where L is the length of the audio file and W the number of windows.Regarding the "params" key they mean: peCoeff (Pre-emphasis coefficient), windowType (Window type -"Hamming"), windowSizems (Window length in milliseconds), windowOverlap (Window overlap fraction), lpcOrder (Number of AR coefficients), zOrder (Number of MA coefficients), fs (Downsampling frequency, in Hz), cepOrder (Number of cepstral coefficients), cepType (Type of cepstral coefficients -1 for ARMA), and algFlag (Algorithm flag -2 for extended Kalman smoother).Adjusting these parameters fine-tunes the algorithm for optimal analysis of the audio data.

Table 6 .
Statistical comparison of each surgery group against the Contr.group through p-values before the surgical procedure, i.e., Ses. 1.The numbers indicate the total number of occurrences, with percentages in parentheses.

Table 7 .
Intra-group variation of GRBAS, nasality questionnaire and nasalance measurements by session.The p-value is calculated using the T-student test for quantitative variables and the Fisher exact test for the categorical ones.The numbers indicate the total number of occurrences, with percentages in parentheses.